Non-spatial data
Non-spatial data refers to data that does not have any inherent spatial or geographic component. It represents information that is not directly associated with specific locations or coordinates on the Earth’s surface. Non-spatial data can include various types of data, such as numerical values, text, categorical variables, dates, and more. Examples of non-spatial data include financial data, customer information, sales figures, weather measurements, text documents, and demographic statistics. Non-spatial data can be stored in different formats, such as spreadsheets, databases, CSV files, or text files.
In contrast to spatial data, which incorporates geographic attributes and spatial relationships, non-spatial data focuses on attributes and properties that are not directly tied to specific locations. However, non-spatial data can often be combined and analyzed with spatial data to gain insights and make informed decisions in various fields.
Reading tables into DataFrames with Pandas
Pandas is a widely-used Python library that provides powerful and efficient data manipulation and analysis tools. It offers a comprehensive set of data structures, such as the DataFrames that we used in Lab 1. We will use pandas to read a dataset with GPS tracking data for domestic cats in Wellington, NZ. This dataset comes from the Cat Tracker NZ Project, a citizen science project setup to help discover the secret life of NZ cats. The project used GPS technology to track for one week 233 cats between 2015 and 2017.
In the Lab2_data folder, which you downloaded from Learn, you find a csv file named “pet_cats_nz.csv”. In your JupyterHub, inside your Lab2 folder,create a folder named Lab2_data and then upload “pet_cats_nz.csv” to the Lab2_data in your Jupyter Hub before attempting the next steps . Go back to Lab 1 section 4.3 if you do not remember how to upload the data to Jupyter Hub .
Run the code below after uploading the data:
# Importing pandas library and giving it the nickname pd; that way we do not need to type pandas everytime we use a function from this library
import pandas as pd
# Using the function read_csv from pandas to read the csv file. We will stored the csv file into a variable called cats_data
cats_data= pd.read_csv("Lab2_data/pet_cats_nz.csv")
C:\Users\vda14\AppData\Local\Temp\ipykernel_13956\4082554116.py:5: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
cats_data= pd.read_csv(r"Lab2_data\pet_cats_nz.csv")
In Python, placing the letter ‘r’ in front of a string creates a “raw string.” A raw string is a special type of string that treats backslashes as literal characters, rather than as escape characters. Ordinarily, in Python, backslashes are used as escape characters that have special meaning like newline or tab.
As you see above, a DtypeWarning message was returned after reading the pet_cats_nz.csv file. This warning suggests that the data in column 6 of the CSV file contains mixed types, meaning that it contains values of different data types (e.g., strings, numbers) within the same column. Pandas, by default, tries to infer the data types of each column based on the values it encounters during the import process. However, when there are mixed types within a column, it can cause ambiguity in determining the appropriate data type for that column.
To address this warning, the warning message provides two possible solutions:
Specify dtype option on import: You can explicitly specify the data type of each column while importing the CSV file by using the dtype parameter of the pd.read_csv() function. By specifying the data type for column 6, you can ensure that Pandas assigns the correct data type to the column.
Set low_memory=False: Alternatively, you can set the low_memory parameter to False while reading the CSV file. By default, Pandas tries to optimize memory usage by inferring data types incrementally. However, when encountering mixed types, setting low_memory to False forces Pandas to allocate more memory for the import process, which can handle mixed type columns more effectively.
It is important to note that when working with mixed type columns, it is crucial to understand the nature of the data and handle it appropriately. Choosing the correct data type for each column is essential to ensure accurate data analysis and avoid potential errors or inconsistencies.
Slicing data with Pandas
Let’s explore column 6 to verify whether we need to reimport the data. We can select a specific column and/or row in a DataFrame by using their index and the .iloc property. By default, when a DataFrame or Series is created, Pandas assigns a numeric index starting from 0 to each row and column. The index can be thought of as a “label” or an identifier associated with each row/column. usually we access columns by their names and rows by their indices. However, we do not know the name for column 6 yet, therefore we will use the .iloc property to access it.
The .iloc property uses the following syntax:
DataFrame.iloc[ from row number : to row number , from col number : to col number]
or
DataFrame.iloc[ row number, col number]
Let’s apply this logic to select all rows and column 6 of the cats_data DataFrame:
# Note that the row interval is empty, just marked by : , This is an easy way to denote all rows, especially if you do not know the exact number of rows.
cats_data.iloc[:,6]
0 False
1 False
2 False
3 False
4 False
...
406467 NaN
406468 NaN
406469 True
406470 NaN
406471 NaN
Name: manually-marked-outlier, Length: 406472, dtype: object
Retrieving unique values
The output above shows the column name (Name: manually-marked-outlier) and also shows some of the values for that column : False, True and NaN. As we saw in Lab 1 Section 5, we can access data from a specific column by calling the dataframe name followed by the column name between square brackets and quotation marks. We will use that along with the pd.unique() function to perform a further check by retrieving all the unique values (non repeat) for the “manually-marked-outlier” column.
pd.unique(cats_data['manually-marked-outlier'])
array([False, True, nan], dtype=object)
As you can see, in fact we have mixed data type in that column. True and False are Boolean values (see Lab 1 section 2.3.3) and NaN is s a special floating-point (see Lab 1 section 2.3.2) value used to represent missing or undefined numerical data. As this will not interfere with our analysis, we can keep going without importing the data again.
Now, use the property you learned in Lab 1 to show the first 10 lines of the cats_data DataFrame.
Revisit Lab 1 Section 5 if you do not remember the property you should use to show the first 10 lines of a DataFrame.
# Write your code to show the first 10 lines of the cats_data DataFrame here. Your output should look like what you see below.
| 0 |
713296458 |
True |
2015-04-09 00:03:10.000 |
174.801407 |
-41.203537 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 1 |
713296459 |
True |
2015-04-09 00:07:41.000 |
174.802032 |
-41.203636 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 2 |
713296460 |
True |
2015-04-09 00:13:43.000 |
174.802032 |
-41.203533 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 3 |
713296461 |
True |
2015-04-09 00:16:50.000 |
174.802261 |
-41.203415 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 4 |
713296462 |
True |
2015-04-09 00:20:21.000 |
174.802094 |
-41.203964 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 5 |
713296463 |
True |
2015-04-09 00:24:27.000 |
174.801559 |
-41.202652 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 6 |
713296464 |
True |
2015-04-09 00:27:41.000 |
174.801788 |
-41.203220 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 7 |
713296465 |
True |
2015-04-09 00:31:35.000 |
174.801865 |
-41.203350 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 8 |
713296466 |
True |
2015-04-09 00:34:49.000 |
174.801880 |
-41.204029 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 9 |
713296467 |
True |
2015-04-09 00:37:59.000 |
174.801788 |
-41.203606 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
Now, use the pd.unique() function and the individual-local-identifier column to retrieve the name of all cats participating in the study:
Revisit what we did a few code blocks above for cats_data[‘manually-marked-outlier’] if you are not sure how you should use the pd.unique() function and the individual-local-identifier column.
# Write here your code to show the name of all cats participating in the study. Your output should look like what you see below.
array(['Murphy', 'Neko', 'Niko', 'Panda', 'Pussums', 'Sky', 'Smudge',
'Tabitha', 'Taco', 'Teddy', 'Theo', 'Banzai', 'Blackie', 'Floss',
'Garfield', 'Gingernut', 'Jess', 'Jinx', 'Kiri', 'Mogadon',
'Molly', 'Muffin', 'Boots', 'BootsII', 'Greyskull2', 'Kitty',
'Allie', 'Ares', 'Ava', 'Fritz', 'Geordie', 'Speckles1', 'Babs1',
'Timmy1', 'Barnaby1', 'Tigger', 'Maximus1', 'Lucy', 'Snuffles',
'Poppy', 'Tyson', 'Diesel', 'Sophia', 'Huffer', 'Gracie', 'Monty',
'Zulu', 'Max', 'PoppyII', 'Minnie', 'Mabel', 'Colin', 'Yasas',
'Zeus', 'Delphi', 'Aslan', 'Nala', 'Kaos', 'Okie', 'Riggs',
'Sylvester', 'MontyII', 'Kyro', 'Charlie', 'MuffinII', 'George',
'Bailee', 'Apples', 'Pharos', 'Richie', 'HappyFeet', 'Carina',
'Samantha', 'Cornflake', 'Maddie', 'Maggie', 'Chilli',
'ArohaJools', 'Henry', 'Luna', 'Milkshakes', 'Oscar', 'Marley',
'Tina', 'Moka', 'Milo', 'Penny', 'Jingles', 'Snifer',
'BorisdeJong', 'Skittles', 'Bella', 'Ash', 'Ellie', 'Izzie',
'Ollie', 'Tiger', 'Nougie', 'Nimzy', 'Spud', 'McIvor', 'Bramble',
'Roger', 'Rosie', 'Sabbath', 'Presley', 'BellaII', 'Midnight',
'Percy', 'BellaIII', 'Lenny', 'Tasha', 'PiggieSue', 'Basil',
'Lorenzo', 'Lilly', 'Terabyte', 'Pushkin', 'Cookie', 'Arya',
'Buzz', 'Tenzing', 'Sativa', 'Moses', 'PoppyIII', 'Elliot',
'Gertie', 'SmudgyBum', 'Iggy', 'Dotti', 'Ewok', 'Catsby', 'Casper',
'Santi', 'Paloma', 'Maomao', 'Rogue', 'NekoII', 'Nugget', 'Dana',
'MarcellusWallace', 'RosieII', 'Astryl', 'Felix', 'Pepper',
'Willow', 'Xander', 'Tara', 'Stinkey', 'KittysaurusRex',
'TriceraCat', 'Alice', 'TamariAlmond', 'Cazador', 'ZeusII', 'Axel',
'Gin', 'Spice', 'Missie', 'Minerva', 'Selena', 'Aggie', 'Thomas',
'Katy', 'Jax', 'Stella', 'PoppyIV', 'Laila', 'Pablo', 'Merlin',
'Whiskey', 'Bonnie', 'NalaII', 'SkyII', 'Subie', 'Coco', 'Lux',
'Harvey', 'Vinnie', 'Guss', 'Gus', 'Ginge', 'Zeno', 'Todd',
'Polly', 'Sandman', 'Rocky', 'Bear', 'Pipi', 'Mungo',
'LemonSqueezy', 'Winston', 'Tabs', 'Marty', 'Schrodinger',
'Archie', 'Ruby', 'Alfie', 'Loki', 'Hazel', 'ThomasII', 'Tui',
'Kimy', 'Dash', 'Toby', 'Suss', 'Seb', 'Cutie-Girl',
'Shadow-Ninja', 'Dingo', 'Wilson', 'Buddy', 'Tilly', 'Marmite',
'GinII', 'Chess', 'Buster', 'MilliGal', 'CocoNelson',
'TigerNelson', 'BlueNelson', 'RedNelson', 'WillabelleNelson',
'PearlyNelson', 'FrostyNelson', 'Sooty', 'Sleeky', 'GusNelson',
'PussNelson', 'InkyNelson', 'GizmoNelson', 'JackNelson',
'CornwallLola'], dtype=object)
We can use the .nunique() function to confirm the number of cats we have in the dataset:
cats_data['individual-local-identifier'].nunique()
Now that we know there are 233 cats, let’s find out which cat has the most GPS points in the dataset usng DataFrame[‘column’].valuecounts():
#count unique 'points' values, grouped by team
cats_data['individual-local-identifier'].value_counts()
Luna 5151
Whiskey 4381
SkyII 3694
BellaII 3647
Penny 3630
...
Greyskull2 177
Aggie 156
Oscar 144
Barnaby1 98
Boots 11
Name: individual-local-identifier, Length: 233, dtype: int64
Now that we know that Luna is the cat with most data, let’s create a new dataset just for her.
Selecting data based on attributes
Selecting data based on attributes allows us to filter and extract specific information that meets desired criteria. Attribute-based selection helps in data preprocessing and enables us to focus on specific attributes of interest. A selection based on attribute on a Pandas DataFrame uses the following syntax:
Subset_DataFrame = DataFrame[condition]
Conditions usually follow the syntax below:
DataFrame['column_name'] comparison_operator (==, >, <, etc) desired_value
Let’s use the structure and syntax explained above to create a dataset with GPS data only for the cat named Luna:
# Note that as the cat name (Luna) is a string we MUST place it between quotation marks.
cat= cats_data[cats_data['individual-local-identifier'] == "Luna"]
Inspect the result by calling the new DataFrame:
| 123031 |
1017621486 |
True |
2015-09-19 12:02:09.000 |
175.050461 |
-41.128876 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 123032 |
1017621487 |
True |
2015-09-19 12:05:18.000 |
175.050262 |
-41.128883 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 123033 |
1017621488 |
False |
2015-09-19 12:08:27.000 |
175.050293 |
-41.128960 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 123034 |
1017621489 |
False |
2015-09-19 12:12:06.000 |
175.050446 |
-41.128845 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 123035 |
1017621490 |
False |
2015-09-19 12:15:45.000 |
175.050583 |
-41.128906 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| ... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
| 128177 |
1107850120 |
False |
2015-12-11 11:37:27.000 |
175.050339 |
-41.128876 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 128178 |
1107850121 |
False |
2015-12-11 11:45:27.000 |
175.050156 |
-41.129082 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 128179 |
1107850122 |
False |
2015-12-11 11:51:08.000 |
175.050400 |
-41.129005 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 128180 |
1107850123 |
False |
2015-12-11 11:57:13.000 |
175.050354 |
-41.128963 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
| 128181 |
1107850124 |
False |
2015-12-11 12:02:36.000 |
175.050735 |
-41.129272 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
5151 rows × 12 columns
Our subset dataset for Luna has 5151 records and the same 12 columns as the original data (5151 rows × 12 columns).
Sorting by attribute
Now that our timeline column is in the correct data type, we can sort our DataFrame using .sort_values(). The argument by = indicates the column to use to order the DataFrame and the argument ascending =True indicates that we want to order from the smallest to the largest value. To order from largest to smallest we would need to use ascending = False. Sorting pandas dataframes will return a dataframe with sorted values if inplace=False and if inplace=True , it will return None and it will modify the original dataframe itself.
Let’s use .sort_values() to organise our original DataFrame from the earliest to the latest datetime in the timeline column:
cat.sort_values(by="timeline", ascending= True, inplace= True)
C:\Users\vda14\AppData\Local\Temp\ipykernel_38076\232428837.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
cat.sort_values(by="timeline", ascending= True, inplace= True)
# Line of code that you need to research to fully understand
cat.reset_index(inplace= True, drop= True)
# Inspecting the result
cat
| 0 |
1017621486 |
True |
2015-09-19 12:02:09.000 |
175.050461 |
-41.128876 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:02:09 |
| 1 |
1017621487 |
True |
2015-09-19 12:05:18.000 |
175.050262 |
-41.128883 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:05:18 |
| 2 |
1017621488 |
False |
2015-09-19 12:08:27.000 |
175.050293 |
-41.128960 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:08:27 |
| 3 |
1017621489 |
False |
2015-09-19 12:12:06.000 |
175.050446 |
-41.128845 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:12:06 |
| 4 |
1017621490 |
False |
2015-09-19 12:15:45.000 |
175.050583 |
-41.128906 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:15:45 |
| ... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
| 5146 |
1107850120 |
False |
2015-12-11 11:37:27.000 |
175.050339 |
-41.128876 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-12-11 11:37:27 |
| 5147 |
1107850121 |
False |
2015-12-11 11:45:27.000 |
175.050156 |
-41.129082 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-12-11 11:45:27 |
| 5148 |
1107850122 |
False |
2015-12-11 11:51:08.000 |
175.050400 |
-41.129005 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-12-11 11:51:08 |
| 5149 |
1107850123 |
False |
2015-12-11 11:57:13.000 |
175.050354 |
-41.128963 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-12-11 11:57:13 |
| 5150 |
1107850124 |
False |
2015-12-11 12:02:36.000 |
175.050735 |
-41.129272 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-12-11 12:02:36 |
5151 rows × 13 columns
So far our data is not in a spatial format. Despite having geographic coordinates (location-long, location-lat), these have so far been standard non-spatial attributes. We will use these columns to generate geographic data in the form of vectors.
Vector data
Vector data refers to a type of geospatial data that represents geometric objects using points, lines, and polygons. It is commonly used to represent features such as roads, buildings, and boundaries. Each object in vector data is defined by its geometry (coordinates) and can also have associated attributes such as names or numerical values. Shapely and Geopandas are the two main Python libraries for manipulating vector data.
Creating geometries using shapely
Shapely is a Python library for geometric operations and analysis. It simplifies complex computations on points, lines, and polygons, making it essential for GIS, mapping, and spatial data tasks. Shapely provides the fundamental geometric operations and data structures for Geopandas. Most of the time we will be working with Geopandas but understanding shapely structures is important for making the most of Geopandas.
In Python, the from library.module import statement is used to import specific functions, classes, or variables from a module into your code. This statement allows you to selectively import only the required elements from a module, rather than importing the entire module.
- from: It is a keyword that signals that you want to import specific elements from a module.
- library.module: This represents the name of the module from which you want to import elements. The module can be a built-in module, a standard library module, or a custom module you have created.
- import: It is a keyword used to specify that you want to import specific elements from the specified module.
After the import keyword, you can list the specific elements you want to import from the module. These elements can be functions, classes, or variables defined in the module. You can separate multiple elements with commas.
That is what we will do to import the fundamental geometry types implemented as Shapelyobjects:
from shapely.geometry import MultiPoint, Point, MultiLineString, LineString, MultiPolygon, Polygon
Now we can start creating some geometries. Let’s start by creating a point to represent the first GPS location registered for Luna in the cat DataFrame. First we need to select the x (longitude) and y (latitude) values for the first record in the cat DataFrame, which we sorted by timeline. We will use .loc() because it allows us to select data based on column name and row index:
# Stores in x the value of the cell defined by the first row (index 0) and the column 'location-long'
x= cat.loc[0,'location-long']
# Stores in y the value of the cell defined by the first row (index 0) and the column 'location-lat'
y= cat.loc[0,'location-lat']
Use the print function to visualize the values of the x and y variables:
Revisit what we did Lab 1 section 2.2 if you are not sure how you should use the print() function and the individual-local-identifier column.
# Write your code to visualize the values of the `x` and `y` variables. Your output should look like what you see below.
Point
Now we will use the x and y variables to create a Point object and store it in a variable named p1:
# Use the Point function to create your point object
p1=Point(x,y)
# Visualize the resulting object
p1
Let’s create another point representing the second location recorded in the cat DataFrame for Luna. This time we will select data directly instaed of creating a x and y variable.
# Use the Point function to create your point object
p2=Point(cat.loc[1,'location-long'],cat.loc[1,'location-lat'])
# Visualize the resulting object
p2
Use the same structure and logic as above to create a visualize another point (p3) representing the third location recorded in the cat DataFrame for Luna.
#Your code here. REMEMBER!!! PYTHON COUNTS FROM ZERO!
We can calculate the minimum Euclidian distance between two shapely objects by using the .distance() function. The .distance function takes another geometric object as an argument and returns the minimum distance between the two objects. The distance returned is measured in the same units as the units used by the coordinate reference system of the objects.
Let’s calculate the distance between the first (p1) and the third point (p3) of Luna’s trajectory:
# Measuring the distance between p1 and p3
p1.distance(p3)
The distance between p1 and p3 is 0.00018782971011240446 decimal degrees. Distance is being measured in decimal degrees because the location-long and location-lat columns, which we used to generate these points, are in decimal degrees. I know that those are the units because 1) GPS data are often delivered in a geographic coordinate system, which use degree as unit; and 2) based on my experience and just by looking at the format of the coordinate columns I often can tell the units. While a shapely object will not store metadata on coordinate reference system and units, geopandas is a Python library that will help us with that. We will use geopandas later in this lab, but in order to fully understand geopandas you first need to understand shapelyobjects.
We can also extract x and y coordinates from a point object, using the .x and .y properties:
# Print the x and y coordinates for p1
print(p1.x, p1.y)
MultiPoint
A MultiPoint object is similar to a Point object in Shapely but can contain multiple points instead of just one. It is useful when you want to work with a group or set of points as a single entity. Each point within a MultiPoint is defined by its X and Y coordinates. The three points we have created so far (p1,p2 and p3) represent sequential positions for the same cat, Luna. Therefore we could group them into a MultiPoint object:
# Use the Point function to create a multipoint object that contains p1,p2 and p3.
# Note that a list of object [] is passed to MultiPoint()
mp_luna= MultiPoint([p1,p2,p3])
# Visualize the resulting object
mp_luna
The output above makes it look like we only have two points in that MultiPoint object. That happens because two of those points are so close that they overlap when displayed. We can perform a further inspection by using the print() function to display the content of mp_luna
# Write here your code to show the content of mp_luna using the print() function. We have used this function before. Your output should look like what you see below.
MULTIPOINT (175.050461 -41.128876, 175.050262 -41.128883, 175.050293 -41.12896)
As you can see above, we have three pairs of x and y coordinates separated by comma. Therefore all three points were successfuly included in our mp_luna MultiPoint object.
The .geoms attribute in Shapely is used to access the individual geometric objects within a collection. It is available for geometric objects that can contain multiple sub-geometries, such as MultiPoint, MultiLineString, and MultiPolygon. We can use it in combination with the geometry index to access to the individual geometries within a collection. Let’s access the first point of mp_luna:
Similarly to what we did for our points, wWe can also extract x and y coordinates from each point within a MultPoint object using the .x and .y properties combined with the geoms function. For that however, we need to iterate over the objects within the collection of geometries (MultiPoint, MultiLine, MultiPolygon). There are at least two ways of doing this. Let’s test the first one, which uses a for loop:
# You can read this as : For any element p in mp_luna.geoms
# Show their x coordinates
for p in mp_luna.geoms:
print(p.x)
175.050461
175.050262
175.050293
List comprehension
The second, and most common, way of doing this uses list comprehension. List comprehension is a concise and powerful feature in Python that allows you to create new lists by performing operations on existing iterables, such as lists, tuples, or strings. It provides a compact syntax for generating lists based on specified conditions or transformations, without the need for explicit loops like we used above .
The basic syntax of a list comprehension is as follows:
[expression for item in iterable if condition]
- expression: It represents the operation or transformation to be applied to each item in the iterable. This expression generates the elements of the new list.
- item: It represents the variable that takes each item of the iterable in each iteration.
- iterable: It is the existing sequence over which the iteration occurs, such as a list, tuple, or string.
- if condition (optional): It is an optional conditional statement that filters the items from the iterable based on a specified condition. Only the items that satisfy the condition are included in the new list.
Let’s use list comprehension to extract the x coordinates from each point in mp_luna:
# Here p.x is our expression, p is our item and mp_luna.geoms is the iterable.
[p.x for p in mp_luna.geoms]
[175.050461, 175.050262, 175.050293]
Use the list comprehension, as above, to extract the y coordinates from each point in mp_luna:**.
# Write here your code here. Your output should look like what you see below.
[-41.128876, -41.128883, -41.12896]
Lines
In Shapely, lines are represented by the LineString object. A LineString is defined by a collection of points (vertices) in a specific order, where each consecutive pair of points forms a line segment. Let’s create a LineString representing Luna’s path between p1,p2 and p3:
# Note that the points are passed inside a list []. If we were to create the line from coordinates, we would use
# the following structure [(x1,y1),(x2,y2),(x3,y3)]
path_luna= LineString([p1,p2,p3])
path_luna
# Note that the LINESTRING is made of pairs of coordinates (x,y) that came from the points (p1,p2,p3) we gave as input.
print(path_luna)
LINESTRING (175.050461 -41.128876, 175.050262 -41.128883, 175.050293 -41.12896)
We can check the distance covered by Luna’s path using the length property:
Once again, the distance is being measured in decimal degrees because the location-long and location-lat columns are in decimal degrees. We will learn how to change that later.
MultiLine
The MultiLineString object represents a collection of LineString objects. It is used to store and manipulate multiple line segments as a single geometric entity. We can create a MultiLineString object to represent Luna’s paths for two different days for example. To do that we first need to select data for the days we want:
# First we check the data again to identify which column we can use to select by day
cat.head(3)
| 0 |
1017621486 |
True |
2015-09-19 12:02:09.000 |
175.050461 |
-41.128876 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:02:09 |
| 1 |
1017621487 |
True |
2015-09-19 12:05:18.000 |
175.050262 |
-41.128883 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:05:18 |
| 2 |
1017621488 |
False |
2015-09-19 12:08:27.000 |
175.050293 |
-41.128960 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:08:27 |
The timeline column includes YYYY-MM-DD HH:MM:SS, as we want to select by day we need to first obtain all the unique (non repeated) days in that column. We can use .dt.date to extract only the date for each item in a Pandas data series (a column in a pandas dataframe is a pandas data series). Then we use pd.unique to return non repeated days:
# The code below follows the following structure pd.unique( dataframe.timestamp_column.dt.date)
days=pd.unique(cat.timeline.dt.date)
# Inspect all results to see the days included in the data
days
array([datetime.date(2015, 9, 19), datetime.date(2015, 9, 20),
datetime.date(2015, 9, 21), datetime.date(2015, 9, 22),
datetime.date(2015, 9, 23), datetime.date(2015, 9, 24),
datetime.date(2015, 9, 25), datetime.date(2015, 9, 26),
datetime.date(2015, 9, 27), datetime.date(2015, 9, 28),
datetime.date(2015, 12, 4), datetime.date(2015, 12, 5),
datetime.date(2015, 12, 6), datetime.date(2015, 12, 7),
datetime.date(2015, 12, 8), datetime.date(2015, 12, 9),
datetime.date(2015, 12, 10), datetime.date(2015, 12, 11)],
dtype=object)
# We can use indices to select a specific day from the variable days
# Here we use [0] to select the first day in the list
days[0]
datetime.date(2015, 9, 19)
Knowing that, now we can use attribute selection to create two new dataframes with data only for the first and second days in the dataset:
# creating a dataframe for the first day by selecting only rows where timeline.dt.date == days[0] i.e timeline.dt.date == 19/09/2015
day_1= cat[cat.timeline.dt.date == days[0]]
# Checking the number of rows found for that day i.e. the numebr of GPS locations collected on that day
day_1.size
# creating a dataframe for the second day by selecting only rows where timeline.dt.date == days[1] i.e timeline.dt.date == 21/09/2015
day_2= cat[cat.timeline.dt.date == days[1]]
# Checking the number of rows found for that day i.e. the numebr of GPS locations collected on that day
day_2.size
# Check the dataset
day_1.head(3)
| 0 |
1017621486 |
True |
2015-09-19 12:02:09.000 |
175.050461 |
-41.128876 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:02:09 |
| 1 |
1017621487 |
True |
2015-09-19 12:05:18.000 |
175.050262 |
-41.128883 |
NaN |
False |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:05:18 |
| 2 |
1017621488 |
False |
2015-09-19 12:08:27.000 |
175.050293 |
-41.128960 |
NaN |
True |
gps |
Felis catus |
LunaTag |
Luna |
Pet Cats New Zealand |
2015-09-19 12:08:27 |
The zip() function
To create a MultiLineString object representing Luna’s paths for both days we need to combine the x (location-long) and y (lotation-lat) coordinates into tuples (See Lab 1 section 2.6.4 for more on tuples). We can use the zip() function to combine elements from multiple iterables (such as lists, tuples, strings and data series) into tuples. In Python 3, zip() returns an iterator, which provides a way to access the elements of a collection or data stream sequentially pairing them together based on their respective positions. The basic syntax of the zip() function is as follows:
zip(iterable1, iterable2, ...)
Now see how it works with a schematic example:
long = [1, 2, 3]
lat = [5,4,12]
zipped = zip(long, lat)
for pair in zipped:
print(pair)
# Output:
(1, 5)
(2, 4)
(3, 12)
Now that you understand how it works, let’s use the zip() function to create individual LineStrings for each day and then create a MultiLineString object for both days and compare those:
# Creating a LineString for day one.
LineString(zip(day_1['location-long'], day_1['location-lat']))
# Creating a LineString for day two.
LineString(zip(day_2['location-long'], day_2['location-lat']))
# Now we create the MultiLineStrign object. Note that the zipped coordinates for day 1 and day 2 are passed as a list[] and separated by comma
path_d1d2= MultiLineString([zip(day_1['location-long'], day_1['location-lat']),zip(day_2['location-long'], day_2['location-lat'])])
# Inspect the result
path_d1d2
In the same way we did for LineString, we can check the distance covered by Luna’s paths using the length property:
The distance above is the sum of the distances covered for each daily path, if we want the distance covered in each day we need to use list comprehension again with geoms:
print([x.length for x in path_d1d2.geoms])
[0.08204094490944104, 0.08096749198530914]
Polygon
The Polygon object is defined by a closed ring of points, where each consecutive pair of points represents an edge of the polygon. The last point in the ring is implicitly connected to the first point, forming a closed shape. We can create a polygon to represent the bounding box that encompasses Luna’s trajectories for all days. A bounding box is a rectangular region that completely encloses an object or a collection of objects. To define this bounding box we first need to find the minimum and maximum x and y coordinates for Luna’s GPS data:
# Finding the min anD max for x (long) and y (lat) coordinates
xmin,ymin,xmax, ymax = cat['location-long'].min(), cat['location-lat'].min(), cat['location-long'].max(), cat['location-lat'].max()
#Inspecting these values
print(xmin,ymin,xmax, ymax)
174.786041 -41.132217 175.066498 -41.120335
Multiple attribution, also known as multiple assignment or simultaneous assignment, is a feature in Python that allows you to assign values to multiple variables in a single statement. It provides a convenient way to assign values to multiple variables at once, using a comma-separated list of values on the right-hand side of the assignment operator (=). Like we just did above with:
xmin,ymin,xmax, ymax = cat[‘location-long’].min(), cat[‘location-lat’].min(), cat[‘location-long’].max(), cat[‘location-lat’].max()
Now that we have the required coordinates, we can create the polygon representing the bounding box for Luna’s GPS locations:
# Creates the bounding box polygon by passing a list [] of tuples () with coordinates that define each corner
polygon = Polygon([(xmin,ymin),(xmin,ymax),(xmax,ymax), (xmax,ymin)])
#Inspect the result
polygon
Inspect the structure of the resulting object:
POLYGON ((174.786041 -41.132217, 174.786041 -41.120335, 175.066498 -41.120335, 175.066498 -41.132217, 174.786041 -41.132217))
We can calculate the area of this bounding box or any polygon by using the property area :
# Print the area of the polygon
print(polygon.area)
We can calculate the perimeter of this bounding box or any polygon by using the property lenght :
# Print the perimeter of the polygon
print(polygon.length)
MultiPolygon
The “multipolygon” object is defined by a set of exterior and interior rings. The exterior ring defines the outer boundary of the entire multipolygon, while the interior rings (also known as holes) define areas inside the multipolygon that are excluded from the overall shape. We can create a multipolygon to represents the bounding box that encompasses Luna’s trajectories two different days. A bounding box is a rectangular region that completely encloses an object or a collection of objects. To define this bounding box we first need to find the minimum and maximum x and y coordinates for Luna’s GPS data for day_1 and day_2:
# Finding the min and max for x (long) and y (lat) coordinates for day_1
xmin,ymin,xmax, ymax = day_1['location-long'].min(), day_1['location-lat'].min(), day_1['location-long'].max(), day_1['location-lat'].max()
#Inspecting these values
print(xmin,ymin,xmax, ymax)
175.04953 -41.129704 175.066498 -41.120335
Now that we have the required coordinates, we can create the polygon representing the bounding box for Luna’s GPS locations in day_1:
# Creates the bounding box polygon by passing a list [] of tuples () with coordinates that define each corner
poly_d1= Polygon([(xmin,ymin),(xmin,ymax),(xmax,ymax), (xmax,ymin)])
#Inspect the result
poly_d1
Use the same logic and steps as in the previous two cells of code to create a polygon (named poly_d2) that represents the bounding box for Luna’s GPS locations in day_2:
# WRITE YOUR CODE HERE. YOUR OUTPUT SHOULD BE WHAT YOU SEE BELOW:
175.049637 -41.129341 175.055054 -41.128326
Now we can use poly_d1 and poly_d2 to create a multypolygon object that includes the individual bounding boxes for Luna’s GPS locations in day_1 and day_2:
# Note that polygons are passed as a list [] of objects
mpl = MultiPolygon([poly_d1, poly_d2])
# Inspect the object structure
print(mpl)
#Visualize the resulting polygons
mpl
MULTIPOLYGON (((175.04953 -41.129704, 175.04953 -41.120335, 175.066498 -41.120335, 175.066498 -41.129704, 175.04953 -41.129704)), ((175.049637 -41.129341, 175.049637 -41.128326, 175.055054 -41.128326, 175.055054 -41.129341, 175.049637 -41.129341)))
We can calculate the area of these bounding boxes (summed up) by using the property area:
In order to calculate the area for each individual bounding box we need to combine the properties area and geoms:
[x.area for x in mpl.geoms]
[0.00015897319199991155, 5.49825499999767e-06]
For more information on how to use shapely, please refer to the official documentation: https://shapely.readthedocs.io/en/stable/manual.html
Now that you know all the basic shapely geometric objects, we will pick up the pace by extending it with Geopandas. Geopandas extends Shapely by providing geospatial data abstraction, working seamlessly with pandas DataFrames. Managing geospatial data in tabular format becomes more efficient and it allows associating geometries with attribute data.
Creating vector data using Geopandas
Geopandas is a Python library that enhances data handling and analysis for geospatial datasets. Built on the Pandas and Shapely libraries, Geopandas extends Pandas functionalities to include spatial geometry support by adding a geometry column that stores shapely objects. The data structure that combines geometries from shapely and DataFrame attributes from Pandas allowing for spatial operations and queries is called a GeoDataFrame. Geopandas supports reading and writing various geospatial vector file formats, such as .shp and .gpkg, facilitating data integration. By integrating with Matplotlib and other visualization libraries, it also enables the creation of maps and visualizations.
We will use geopandas to create a spatial dataset including all cats in the NZ cat tracking study and perform some basic spatial analysis. First, let’s take another look at our Pandas DataFrame with information for all cats, the one we imported here.
| 0 |
713296458 |
True |
2015-04-09 00:03:10.000 |
174.801407 |
-41.203537 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 1 |
713296459 |
True |
2015-04-09 00:07:41.000 |
174.802032 |
-41.203636 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 2 |
713296460 |
True |
2015-04-09 00:13:43.000 |
174.802032 |
-41.203533 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 3 |
713296461 |
True |
2015-04-09 00:16:50.000 |
174.802261 |
-41.203415 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| 4 |
713296462 |
True |
2015-04-09 00:20:21.000 |
174.802094 |
-41.203964 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
| ... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
| 406467 |
11918524216 |
True |
2017-06-29 20:34:58.000 |
-5.073875 |
50.147003 |
NaN |
NaN |
gps |
Felis catus |
CornwallLolaTag |
CornwallLola |
Pet Cats New Zealand |
| 406468 |
11918524217 |
True |
2017-06-29 20:40:09.000 |
-5.073789 |
50.146812 |
NaN |
NaN |
gps |
Felis catus |
CornwallLolaTag |
CornwallLola |
Pet Cats New Zealand |
| 406469 |
11918524218 |
False |
2017-06-29 20:47:51.000 |
-5.073850 |
50.142307 |
NaN |
True |
gps |
Felis catus |
CornwallLolaTag |
CornwallLola |
Pet Cats New Zealand |
| 406470 |
11918524219 |
True |
2017-06-29 20:56:44.000 |
-5.073898 |
50.145592 |
NaN |
NaN |
gps |
Felis catus |
CornwallLolaTag |
CornwallLola |
Pet Cats New Zealand |
| 406471 |
11918524220 |
True |
2017-06-29 21:04:40.000 |
-5.073836 |
50.145512 |
NaN |
NaN |
gps |
Felis catus |
CornwallLolaTag |
CornwallLola |
Pet Cats New Zealand |
406472 rows × 12 columns
Now we use the GeoDataFrame function to create a GeoDataFrame based on the Pandas DataFrame cats_data (non-spatial) that we already have:
# First import the geopandas library as we have not used it before
import geopandas as gpd
# Now we create a GeoDataFrame (cat_gdf, gdf is a good abbreviation for geodataframe) with the DataFrame (cat) and specify the geometry column (geometry=) as being points (gpd.points_from_xy()) created from
# the 'location-long' and 'location-lat' columns. The "crs =" argument sets the coordinate reference system, EPSG:4326 is the code for WGS 84 latitude/longitude coordinate system.
cat_gdf= gpd.GeoDataFrame(cats_data, geometry=gpd.points_from_xy(cats_data['location-long'],cats_data['location-lat']), crs="EPSG:4326")
# Inspect the result
cat_gdf.head(3)
| 0 |
713296458 |
True |
2015-04-09 00:03:10.000 |
174.801407 |
-41.203537 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
POINT (174.80141 -41.20354) |
| 1 |
713296459 |
True |
2015-04-09 00:07:41.000 |
174.802032 |
-41.203636 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
POINT (174.80203 -41.20364) |
| 2 |
713296460 |
True |
2015-04-09 00:13:43.000 |
174.802032 |
-41.203533 |
NaN |
False |
gps |
Felis catus |
MurphyTag |
Murphy |
Pet Cats New Zealand |
POINT (174.80203 -41.20353) |
We can compare the data structures by using the function type() to display the type of data structure for the original data (cats_data) and the one we just created (cat_gdf):
print(type(cats_data))
print(type(cat_gdf))
<class 'pandas.core.frame.DataFrame'>
<class 'geopandas.geodataframe.GeoDataFrame'>
As you can see above, these datasets belong to different classes. In Python, a class is like a template for creating objects. It is a fundamental concept in object-oriented programming (OOP), which is a paradigm for writing code that allows modelling real-world entities as well as abstract concepts. A class defines the attributes (data) and methods (functions) that an object of that class will have.
.explore is a very usefull method (function) available for GeoDataFrames (see the documentation here for more details). It allows the creation of interactive maps, which amongst many other things, are great for quick visualizations at the early stages of data cleaning and pre-processing. We are currently working with a relatively large dataset (406472 rows x 12 columns), for that reason and to speed-up the plotting we will not use the standard syntax for .explore:
Instead, we will pre-select the geometry column to minimize loading time by using the following syntax:
GeoDataFrame['geometry'].explore()
The column storing the shapely objects is named “geometry” by default and that should not be changed.
Run the code below to visualize our cat data. The code might take a few seconds to run, but once the code finishes running you should have an intercative version of the map shown below. I could not embed all the interactive maps in this document because the file was becomign too large. Therefore I have replaced most dynamic maps with static figures for your reference only. Take some time to zoom in, zoom out and explore the map, once you are done exploring it make sure to comment (#) and rerun the cell below to free up memory as this is a heavy dataset. Otherwise your Jupyter Notebook might become super slow or even crash :
cat_gdf["geometry"].explore()
As you can see from the map above, there are some GPS points in the UK, which is unexpected considering that this is a NZ study. These points in the UK are most likely outliers caused by GPS error due to, for example issues with satellite signal or atmospheric intereferences. Luckily, we have a column in the dataset (‘manually-marked-outlier’) that indicates whether the location was considered an outlier or not. Let’s first check how many outliers we have in this dataset
# Here we use a selection by attribute to retrieve only rows that were marked as outliers (['manually-marked-outlier']==True) and then use the property .shape[0] to retrieve the number of rows selected.
# GeoDataFrame.shape retrieves the number of (rows, columns) but we can access only rows by using the index [0] or only columns by using the index [1] or both by just using .shape
(cat_gdf[cat_gdf['manually-marked-outlier']==True]).shape[0]
Now we know that we have 375670 outliers, which is a lot in a dataset with 406472 points. In this case, we will trust the specialists that were working with these data as they know best about the limitations of the GPS equipments being used. Therefore, we will update our variable cat_gdf so that this GeoDataFrame will now only have the points that were not manually marked as outliers:
# Selecting only rows (points) that were not marked as outliers
cat_gdf=cat_gdf[cat_gdf['manually-marked-outlier']==False]
# Checking the number of remaining rows
cat_gdf.shape[0]
We now have 29684 points that were considered accurate by the researchers using this dataset. Use what we learned previously in this lab to display an interactive map and visualize the selected data:
# YOUR CODE HERE. YOUR OUTPUT SHOULD BE AN INTERACTIVE VERSION OF WHAT YOU SEE BELOW
The lambda function
In the compact version above we used something new, a lambda function. In Python, a lambda function is a concise way of creating small, anonymous functions. It allows you to define a simple function in a single line without giving it a formal name. Lambda functions are often used when you need a quick function for a short period and don’t want to define a regular named function using the def keyword.
The syntax of a lambda function is as follows:
lambda arguments: expression
here:
- lambda is the keyword that marks the start of a lambda function.
- arguments are the input parameters of the function, similar to regular function arguments.
- expression is the single expression that the lambda function evaluates and returns.
In our “compact all at once and very efficient” above those were: * lambda: lambda * arguments: x and here x is equal to each group resulting from cat_loc_gdf.groupby([‘individual-local-identifier’], as_index=False)[‘geometry’] * expression: MultiPoint(x.tolist())
Here’s an example of a simple lambda function that calculates the square of each row in a column A in a given DataFrame df:
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3]})
print("Original data:")
print(df['A'])
# Use .apply() with lambda to square each element of the DataFrame
squared_df = df['A'].apply(lambda x: x**2)
# Check the results
print("Squared data:")
print(squared_df)
Original data:
0 1
1 2
2 3
Name: A, dtype: int64
Squared data:
0 1
1 4
2 9
Name: A, dtype: int64
Defining coordinate reference systems (CRS)
Coordinate Reference Systems (CRS) define how spatial data is represented on a map. It specifies a mathematical framework to convert real-world geographic coordinates into projected coordinates on a flat surface, enabling accurate spatial analysis and visualization when defined correctly. However, things can look really weird and displaced when CRS are not defined or incorrect. Let’s take a look on the multipoint dataset we created for all cats by visualizing it:
cat_multi["geometry"].explore()
Oh-Oh!!! We cannot see the background map anymore! It is all grey and as if our GeoDataFrame cat_multi were no longer linked to any location on Earth! This happens more often than we would like to be honest, but the good news is that there are easy ways to find out why and fix it!
Let’s start by using the .crs property to verify the coordinate reference system adopted for the GeoDataFrame cats_multi:
Here we go! The problem is that there is no Coordinate Reference System (CRS) defined, that is why .crs returns None as a result. Defining a CRS can be metaphorically compared to defining a dictionary for a language in the context of a book. Imagine you have a book written in a language with a unique alphabet and grammar rules. Without a dictionary, you may struggle to understand the meaning of certain words or phrases. The dictionary acts as a reference guide, providing definitions and translations for words, enabling you to interpret and comprehend the content.
Similarly, a CRS serves as a reference guide for geospatial data. Geographic coordinates without a CRS are just numbers, like words without definitions, lacking context and meaning. Setting a CRS establishes a standardized framework that translates real-world geographic locations into projected or geographic coordinates on a map. This ensures that spatial data is accurately positioned, allowing you to navigate, analyze, and visualize the data correctly, just as a dictionary enables you to understand and interpret the language in a book.
We can define a CRS for our data using the function set_crs().
The crs argument is defined by and EPSG code (see here for more on epsg codes) that uniquely identifies a CRS. Here we use ‘EPSG:4326’ because this is the epsg code for lat-long geographic WGS84 CRS, in which our GPS data (and most GPS data) were collected.
# The argument inplace =True indcates tha the original geodataframe `cat_multi` must be modified
# If we do not use inplace =True we would need to do new_variable = cat_multi.set_crs(crs='EPSG:4326')
cat_multi.set_crs(crs='EPSG:4326', inplace =True)
| 0 |
Aggie |
MULTIPOINT (174.77098 -41.25353, 174.77107 -41... |
| 1 |
Alfie |
MULTIPOINT (174.79213 -41.24120, 174.79213 -41... |
| 2 |
Alice |
MULTIPOINT (175.07092 -40.90078, 175.07121 -40... |
| 3 |
Allie |
MULTIPOINT (174.76271 -41.26717, 174.76289 -41... |
| 4 |
Apples |
MULTIPOINT (174.79509 -41.29774, 174.79509 -41... |
| ... |
... |
... |
| 227 |
Yasas |
MULTIPOINT (174.98343 -40.91526, 174.98349 -40... |
| 228 |
Zeno |
MULTIPOINT (174.78447 -41.31677, 174.78447 -41... |
| 229 |
Zeus |
MULTIPOINT (174.76651 -41.26687, 174.76608 -41... |
| 230 |
ZeusII |
MULTIPOINT (174.79810 -41.32442, 174.79819 -41... |
| 231 |
Zulu |
MULTIPOINT (174.80708 -41.22766, 174.80745 -41... |
232 rows × 2 columns
Now let’s check the CRS information again:
# YOUR CODE HERE TO CHECK THE CRS INFORMATION. YOUR OUTPUT SHOUDL BE LIKE THE ONE BELOW. IF YOU DO NOT REMEMEBR HOW, WE JUST DID IT A FEW BLOCKS OF CODE ABOVE.
We can obtain all the parameters for the CRS by doing :
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
Now that we defined the CRS, we should be able to visualize our data with the base map. We will also add a new argument (column) in order to display our data with a different colour for each cat:
# We use column= 'individual-local-identifier' to color by cat name and
# legend = False to hide the legend as this would have more than 200 items, which
# would make it difficult to read and see the map
cat_multi.explore(column= 'individual-local-identifier', legend = False)
Great! It works now and it is in the correct location. However, we still have data for cats that reside outside of Wellington. As we are only interested in analysing cats living in Wellington, we need to perform some spatial operation to select only data within the boundaries for Wellington territorial authority (TA). In order to do that, we first need to import and read spatial data on NZ’s TAs.
Reading vector data using Geopandas
GeoPandas can read vector data in formats like Shapefiles, GeoJSON, Geopackage, KML, GML, WKT, and CSV, facilitating versatile geospatial data analysis and manipulation. In the Lab2_data folder, which you downloaded from Learn, you find a geopackage named “territorial-authority-2021-generalised.gpkg”. In your JupyterHub, upload “territorial-authority-2021-generalised.gpkg” to the Lab2_data folder in your Jupyter Hub before attempting the next steps . Go back to Lab 1 section 4.3 if you do not remember how to upload the data to Jupyter Hub .
Run the code below after uploading the data:
# Reading the geopackage with limits for NZ's TAs into a geodataframe
ta = gpd.read_file("Lab2_data/territorial-authority-2021-generalised.gpkg")
# Inspecting the first three lines
ta.head(3)
| 0 |
032 |
Central Hawke's Bay District |
Central Hawke's Bay District |
3333.140233 |
3333.140233 |
345764.275012 |
MULTIPOLYGON (((1884831.019 5593384.334, 18848... |
| 1 |
033 |
New Plymouth District |
New Plymouth District |
2205.597503 |
2205.597503 |
376935.584957 |
MULTIPOLYGON (((1687622.432 5675977.675, 16876... |
| 2 |
034 |
Stratford District |
Stratford District |
2163.422393 |
2163.422393 |
410331.506021 |
MULTIPOLYGON (((1761018.863 5693193.903, 17610... |
# Checking the number of rows (TAs) and columns (attributes)
ta.shape
We have 68 rows and 7 columns. Therefore, 68 territorial authorities and seven attributes. Visualize the territorial authority data using an interactive map:
# YOUR CODE HERE. YOUR OUTPUT SHOULD LOOK LIKE THE ONE BELOW. Zoom in to New Zealand, hover the mouse over the polygons on the map and your screen shoudl look like the figure below
We will use one of the attributes (TA2021_V1_00 - a code identifying each TA) you saw when hovering over the map to perform a selection by attribute that will remove the geometries representing the areas outside any territorial authority (maritime area around the coast) and the Chatham Island:
# Here we use .isin() to select the rows that have TA2021_V1_00 == 999 (maritime area) or 67 (Chatham Island)
# By placing ~ in front o the expression (ta.TA2021_V1_00).isin([999,67]) we negate it, it is equivalent to place a not
# in front of it, which means that we are selecting the TAs that DO NOT have TA2021_V1_00 == 999 (maritime area) or 67 (Chatham Island)
ta=ta[~(ta.TA2021_V1_00).isin([999,67])]
Check the number of rows to see if our selection worked:
We still have the same number of rows, therefore our selection did not work and we need to investigate the reason. This also happens often and usually the reasons is a difference between data type. In our selection above 999 and 67 are integer numbers and they will only be recognised in the column TA2021_V1_00 if this column’s data type is integer. We can check the data type of a column using the dtypeproperty:
# The syntax is DataFrame.column. dtype
ta.TA2021_V1_00.dtype
The selection did not work because the column TA2021_V1_00 data type is object (‘O’). In order to select the data we need to convert the data type of this column to integer. We will do this conversion on the fly while also selecting the data using the code below:
# .astype(int) is converting ta.TA2021_V1_00 from object to integer
ta=ta[~(ta.TA2021_V1_00.astype(int)).isin([999,67])]
Check the number of rows to see if our selection worked:**
We have 66 rows now, therefore the selection worked. Two rows (TAs) were removed from our data as we originally had 68. Now, let’s visualize the data again:
Now let’s find out how many cats we have in our data set for the remaining territorial units.
Attribute join
An attribute join, also known as an attribute-based join or non-spatial join, is a type of data merging operation performed on datasets, typically in tabular form (e.g., DataFrames). Unlike spatial joins that involve geometric relationships, attribute joins merge data based on shared attribute values, usually a common column or key, in the datasets being merged.
In an attribute join, matching values in the specified columns of the datasets serve as the criteria for combining the data. Rows with identical attribute values in the join columns from both datasets are merged, resulting in a new dataset that combines the attributes from both original datasets.
The pd.merge function in pandas is used to merge two DataFrames based on common columns or indices. It allows you to combine data from different DataFrames into a single DataFrame using various types of joins.
The syntax for pd.merge is as follows:
result = pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)
- left and right: The two DataFrames that you want to merge.
- how: The type of join to perform. It can be ‘inner’ (default), ‘left’, ‘right’, or ‘outer’. It specifies which rows from both DataFrames to include in the result.
- on: The column or index name in both DataFrames that serves as the key for the merge. If the DataFrames have the same column name, you can specify it using this parameter. Alternatively, you can use left_on and right_on to specify different column names in each DataFrame.
- left_on and right_on: The column names in the left and right DataFrames, respectively, that are used as keys for the merge when they have different names.
Now we will use the function .merge() from pandas to combine our cat count per TA unit (cats_tas) with the TA GeoDataFrame (ta):
# we join the datasets based on the TA code (TA2021_V1_00)
ta_cat_count= ta.merge(cats_ta, how= "left", on='TA2021_V1_00')
# Inspect the result
ta_cat_count.head(3)
| 0 |
032 |
Central Hawke's Bay District |
Central Hawke's Bay District |
3333.140233 |
3333.140233 |
345764.275012 |
MULTIPOLYGON (((1884831.019 5593384.334, 18848... |
NaN |
| 1 |
033 |
New Plymouth District |
New Plymouth District |
2205.597503 |
2205.597503 |
376935.584957 |
MULTIPOLYGON (((1687622.432 5675977.675, 16876... |
NaN |
| 2 |
034 |
Stratford District |
Stratford District |
2163.422393 |
2163.422393 |
410331.506021 |
MULTIPOLYGON (((1761018.863 5693193.903, 17610... |
NaN |
Now we can visualize the number of cats per territorial authority:
ta_cat_count.explore(column= 'cats_count')
The territorial authority with the most cats is Wellington with 127 cats in the tracking study. Explore the map to find the TAs with the second and third highest cat count.
Spatial clip
Spatial clip, also known as clipping or spatial subset, is a geospatial operation used to extract a portion of a dataset that intersects or falls within the boundaries of another dataset. It involves defining a boundary or polygon from one dataset (clip features) and then using it to extract the spatially coincident portion of another dataset (input features).
The resulting output contains only the features from the input dataset that intersect with the boundary defined by the clip features. The non-intersecting features are excluded from the output.
We will perform a spatial clip to focus our analysis only on cats within Wellington TA. For tha, we first need to define our clip feature by selecting from the ta dataframe only the geometry that defines Wellington’s TA (code 047):
welli=ta[ta['TA2021_V1_00'] == '047']
# Inspect the result
welli
| 15 |
047 |
Wellington City |
Wellington City |
289.910046 |
290.04623 |
130606.342983 |
MULTIPOLYGON (((1748467.579 5420994.674, 17484... |
# Visualize the resulting geometry
welli.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook
Now we can use the polygon shown on the map above to select only cats living in Wellington. For that we apply the function .clip, which follows the syntax below:
result = gpd.clip(input_features, clipper_feature)
- clip_features: boundary or limiting polygon
- input features: data to be clipped
cats_welli = gpd.clip(cat_multi, welli)
Visualize the result with a diffent color per cat and compare with the multipoint data we have in ?@fig-cats:
cats_welli.explore(column='individual-local-identifier', legend= False)
Make this Notebook Trusted to load map: File -> Trust Notebook
Convex hull
A convex hull is the smallest convex polygon or polyhedron that contains all the given points in a set, ensuring that no points lie outside the polygon’s boundaries. The convex hull helps define the outermost boundary of the point set and is a fundamental concept used in computational geometry and spatial analysis. In animal ecology, the convex hull is one of the simplest methods to estimate the home range of an animal based on its movement data. A home range is the area that an animal regularly uses for its daily activities, such as foraging, resting, and mating.
Let’s calculate convex hulls to estimate home ranges for cats in Wellington:
home_ranges= cats_welli.convex_hull
# Inspect the result
home_ranges.head(5)
118 POLYGON ((1748263.308 5421633.081, 1748250.543...
140 POLYGON ((1748810.635 5422040.497, 1748491.927...
182 POLYGON ((1749559.658 5422190.368, 1749565.505...
187 POLYGON ((1749697.467 5422304.927, 1749663.551...
56 POLYGON ((1748314.644 5422273.124, 1748267.598...
dtype: geometry
Let’s join the home_ranges geometry to the individual-local-identifier column from cats_welli, so that we can visualize the home ranges colored by cat name (individual-local-identifier):
# Before joining we need to give a name to the home_ranges series.
# This name will become the colum name once we join it to the GeoDataFrame
home_ranges= home_ranges.rename('hr')
# Here we use left_index=True and right_index=True to perform the join on the indices (row labels).
home_ranges= pd.merge(home_ranges, cats_welli[["individual-local-identifier"]], left_index=True, right_index=True, how='inner')
# Inspect the result
home_ranges
| 118 |
POLYGON ((1748263.308 5421633.081, 1748250.543... |
Minerva |
| 140 |
POLYGON ((1748810.635 5422040.497, 1748491.927... |
Ollie |
| 182 |
POLYGON ((1749559.658 5422190.368, 1749565.505... |
Sleeky |
| 187 |
POLYGON ((1749697.467 5422304.927, 1749663.551... |
Sooty |
| 56 |
POLYGON ((1748314.644 5422273.124, 1748267.598... |
Floss |
| ... |
... |
... |
| 115 |
POLYGON ((1753103.550 5441000.411, 1753004.313... |
Milkshakes |
| 20 |
POLYGON ((1754246.361 5441080.201, 1754127.058... |
Bella |
| 43 |
POLYGON ((1754590.946 5441205.085, 1754529.970... |
Cookie |
| 8 |
POLYGON ((1754484.694 5441209.445, 1754465.491... |
Arya |
| 189 |
POLYGON ((1752998.898 5441569.502, 1752802.834... |
Speckles1 |
127 rows × 2 columns
The hr column stores polygons, the home range polygons, and by standard the column storing the shapely object (geometry) must be named geometry. We need to tell Python that hr is the column that contains the geometries. In geopandas, the method set_geometry() is used to set or change the geometry column in a GeoDataFrame. It allows you to specify a column that contains geometric objects (e.g., points, lines, or polygons) to be used as the new geometry column for spatial data.
home_ranges.set_geometry('hr', inplace=True)
Let’s also rename the individual-local-identifier column to cat. We can rename a column in a DataFrame using the rename method or by directly assigning a new name to the column:
# Rename the 'individual-local-identifier' column to 'cat'
home_ranges.rename(columns={'individual-local-identifier': 'cat'}, inplace=True)
# Inspect the result
home_ranges
| 118 |
POLYGON ((1748263.308 5421633.081, 1748250.543... |
Minerva |
| 140 |
POLYGON ((1748810.635 5422040.497, 1748491.927... |
Ollie |
| 182 |
POLYGON ((1749559.658 5422190.368, 1749565.505... |
Sleeky |
| 187 |
POLYGON ((1749697.467 5422304.927, 1749663.551... |
Sooty |
| 56 |
POLYGON ((1748314.644 5422273.124, 1748267.598... |
Floss |
| ... |
... |
... |
| 115 |
POLYGON ((1753103.550 5441000.411, 1753004.313... |
Milkshakes |
| 20 |
POLYGON ((1754246.361 5441080.201, 1754127.058... |
Bella |
| 43 |
POLYGON ((1754590.946 5441205.085, 1754529.970... |
Cookie |
| 8 |
POLYGON ((1754484.694 5441209.445, 1754465.491... |
Arya |
| 189 |
POLYGON ((1752998.898 5441569.502, 1752802.834... |
Speckles1 |
127 rows × 2 columns
# Visualize home ranges
home_ranges.explore(column ='cat', legend= False)
Make this Notebook Trusted to load map: File -> Trust Notebook
Calculating areas
The properties we saw for Shapely objects are still valid for each row in a Geopandas DataFrame. We can use the property .area to calculate the home range areas as a new attribute in our data:
# We calculate the area based on the hr column because that is the one with geometries
home_ranges['area_hr'] = home_ranges['hr'].area
# Inspect the results
home_ranges
| 118 |
POLYGON ((1748263.308 5421633.081, 1748250.543... |
Minerva |
18115.842159 |
| 140 |
POLYGON ((1748810.635 5422040.497, 1748491.927... |
Ollie |
83506.149725 |
| 182 |
POLYGON ((1749559.658 5422190.368, 1749565.505... |
Sleeky |
35911.757197 |
| 187 |
POLYGON ((1749697.467 5422304.927, 1749663.551... |
Sooty |
5670.431525 |
| 56 |
POLYGON ((1748314.644 5422273.124, 1748267.598... |
Floss |
40725.647160 |
| ... |
... |
... |
... |
| 115 |
POLYGON ((1753103.550 5441000.411, 1753004.313... |
Milkshakes |
23977.104623 |
| 20 |
POLYGON ((1754246.361 5441080.201, 1754127.058... |
Bella |
35446.276993 |
| 43 |
POLYGON ((1754590.946 5441205.085, 1754529.970... |
Cookie |
15505.695671 |
| 8 |
POLYGON ((1754484.694 5441209.445, 1754465.491... |
Arya |
9180.415535 |
| 189 |
POLYGON ((1752998.898 5441569.502, 1752802.834... |
Speckles1 |
85136.633400 |
127 rows × 3 columns
We calculated the home range areas, but what are the units for these areas? We can find that information by checking the CRS of our data:
<Projected CRS: EPSG:2193>
Name: NZGD2000 / New Zealand Transverse Mercator 2000
Axis Info [cartesian]:
- N[north]: Northing (metre)
- E[east]: Easting (metre)
Area of Use:
- name: New Zealand - North Island, South Island, Stewart Island - onshore.
- bounds: (166.37, -47.33, 178.63, -34.1)
Coordinate Operation:
- name: New Zealand Transverse Mercator 2000
- method: Transverse Mercator
Datum: New Zealand Geodetic Datum 2000
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
As you can see above under Axis Info [cartesian]:, this CRS uses meters as the measurament unit. If we want the area to be in \(km^{2}\) we need to divide it by \(10^{6}\):
home_ranges['area_hr_km'] = home_ranges['area_hr']/1000000
# Inspect the results
home_ranges
| 118 |
POLYGON ((1748263.308 5421633.081, 1748250.543... |
Minerva |
18115.842159 |
0.018116 |
| 140 |
POLYGON ((1748810.635 5422040.497, 1748491.927... |
Ollie |
83506.149725 |
0.083506 |
| 182 |
POLYGON ((1749559.658 5422190.368, 1749565.505... |
Sleeky |
35911.757197 |
0.035912 |
| 187 |
POLYGON ((1749697.467 5422304.927, 1749663.551... |
Sooty |
5670.431525 |
0.005670 |
| 56 |
POLYGON ((1748314.644 5422273.124, 1748267.598... |
Floss |
40725.647160 |
0.040726 |
| ... |
... |
... |
... |
... |
| 115 |
POLYGON ((1753103.550 5441000.411, 1753004.313... |
Milkshakes |
23977.104623 |
0.023977 |
| 20 |
POLYGON ((1754246.361 5441080.201, 1754127.058... |
Bella |
35446.276993 |
0.035446 |
| 43 |
POLYGON ((1754590.946 5441205.085, 1754529.970... |
Cookie |
15505.695671 |
0.015506 |
| 8 |
POLYGON ((1754484.694 5441209.445, 1754465.491... |
Arya |
9180.415535 |
0.009180 |
| 189 |
POLYGON ((1752998.898 5441569.502, 1752802.834... |
Speckles1 |
85136.633400 |
0.085137 |
127 rows × 4 columns
Generating centroids
Centroids are central points, defined by x and y coordinates, identifying the centre of a line, multipoint or polygon that often represent more complex shapes. The centroid of a polygon or line can be located outside of the feature if the geometry has a complex shape like a U-shape or if it has holes. In these cases, the function representative_point that borrows the shapely method can be used to get a point that is guaranteed to be inside the geometry. For more details on the function representative_pointread this.
In point pattern analysis, the centroid of a cluster of points can be used to analyze the overall distribution or spatial pattern of the points. We wil use the function .centroid() to extract a representative point for each home range:
cats_centroids= home_ranges.centroid
# Inspect the result
cats_centroids
118 POINT (1748299.824 5421694.181)
140 POINT (1748722.895 5422235.345)
182 POINT (1749660.747 5422333.766)
187 POINT (1749697.611 5422349.247)
56 POINT (1748322.386 5422429.621)
...
115 POINT (1753083.622 5441117.919)
20 POINT (1754256.711 5441224.030)
43 POINT (1754532.692 5441290.232)
8 POINT (1754536.232 5441287.349)
189 POINT (1753005.002 5441762.697)
Length: 127, dtype: geometry
# Before joining we need to give a name to the cats_centroids series.
# This name will become the colum name once we join it to the GeoDataFrame
cats_centroids= cats_centroids.rename('centroid')
# Here we use left_index=True and right_index=True to perform the join on the indices (row labels).
cats_centroid= pd.merge(cats_centroids, home_ranges[["cat"]], left_index=True, right_index=True, how='inner')
# as we did before with convex hull, we need to set the geometry column
cats_centroid.set_geometry('centroid', inplace=True)
# Inspect the result
cats_centroid
| 118 |
POINT (1748299.824 5421694.181) |
Minerva |
| 140 |
POINT (1748722.895 5422235.345) |
Ollie |
| 182 |
POINT (1749660.747 5422333.766) |
Sleeky |
| 187 |
POINT (1749697.611 5422349.247) |
Sooty |
| 56 |
POINT (1748322.386 5422429.621) |
Floss |
| ... |
... |
... |
| 115 |
POINT (1753083.622 5441117.919) |
Milkshakes |
| 20 |
POINT (1754256.711 5441224.030) |
Bella |
| 43 |
POINT (1754532.692 5441290.232) |
Cookie |
| 8 |
POINT (1754536.232 5441287.349) |
Arya |
| 189 |
POINT (1753005.002 5441762.697) |
Speckles1 |
127 rows × 2 columns
Visualize the resulting centroid using a interactive map and colouring by cat name:
#YOUR CODE HERE. YOUR OUTPUT SHOULD LOOK LIKE THE MAP BELOW. SET legend= False
Make this Notebook Trusted to load map: File -> Trust Notebook
Buffer
Buffering is a common spatial analysis operation used to create a zone or area of influence around a geographic feature. A buffer is a zone that is generated at a specified distance from a spatial object, such as a point, line, or polygon.
Potential applications for buffers are:
Proximity Analysis: Buffering is often used for proximity analysis, allowing GIS analysts to study the influence of one geographic feature on others. For example, creating a buffer around a river can help identify areas that might be affected by flooding.
Spatial Query: Buffering enables spatial querying by identifying features that fall within a certain distance from another feature. For instance, you can use a buffer to find all houses within a 1-mile radius from a school.
Environmental Impact Assessment: In environmental studies, buffering helps assess the impact of human activities on natural habitats. For instance, you can buffer an area around a wildlife sanctuary to see if any human activities encroach upon it.
Accessibility Analysis: Buffering is useful in accessibility studies to determine areas that fall within a specified distance from certain facilities, like hospitals, schools, or public transportation.
Safety and Security: Buffers are applied in safety and security analyses. For example, creating a buffer around a hazardous area helps plan evacuation zones in case of emergencies.
Service Areas: Buffers can define service areas for facilities like gas stations, restaurants, or delivery services. They help analyze how far customers are willing to travel to access these services.
Buffers are generated using the fucntion .buffer()that follows the syntax below:
GeoSeries.buffer(distance, resolution=16)
distance: This parameter specifies the distance (in the same units as the GeoSeries’ coordinate reference system) by which each geometry in the GeoSeries will be buffered. The buffer operation creates a zone around each geometry, and distance determines the size of this buffer zone.
resolution: This optional parameter sets the number of segments used to approximate the circular arcs that form the buffer geometry. Higher values of resolution lead to smoother and more accurate buffer shapes, but also increase the computational complexity.
Some papers suggest that domestic cats can travel 0.4 km per day. We will use this distance as the parameter for our buffer. We will create buffers around the home range centroids to perform a spatial query. Our goal is to find out how many people in Wellington are living within the area of influence of the cats in the study.
# As the CRS for cats_centroid uses meters as unit, we need
# to express the distance in the same units
cats_buffer = cats_centroid.buffer(400, resolution=20)
# Inspect the results
cats_buffer
118 POLYGON ((1748699.824 5421694.181, 1748698.590...
140 POLYGON ((1749122.895 5422235.345, 1749121.662...
182 POLYGON ((1750060.747 5422333.766, 1750059.514...
187 POLYGON ((1750097.611 5422349.247, 1750096.378...
56 POLYGON ((1748722.386 5422429.621, 1748721.153...
...
115 POLYGON ((1753483.622 5441117.919, 1753482.389...
20 POLYGON ((1754656.711 5441224.030, 1754655.478...
43 POLYGON ((1754932.692 5441290.232, 1754931.459...
8 POLYGON ((1754936.232 5441287.349, 1754934.999...
189 POLYGON ((1753405.002 5441762.697, 1753403.769...
Length: 127, dtype: geometry
Visualize all buffers:
Make this Notebook Trusted to load map: File -> Trust Notebook
Unary union
In Geopandas, the unary_union function is used to merge multiple geometric objects (e.g., points, lines, or polygons) within a GeoSeries into a single, unified geometry. We will use it to combine all buffers into a single multipolygon:
cat_zone=cats_buffer.unary_union
# Inspect the result
cat_zone
Check the data type of the object created by unary_union:
<class 'shapely.geometry.multipolygon.MultiPolygon'>
We have a <class 'shapely.geometry.multipolygon.MultiPolygon'> object. In order to visualize it with .explore() and define a CRS for further analysis we need to insert it into a GeoDataFrame:
influence_cat= gpd.GeoDataFrame(data={'Name': ['domestic cats']}, geometry = [cat_zone], crs= cats_buffer.crs)
The line of code above creates a new GeoDataFrame named influence_cat using three main parameters:
- data: A dictionary with the ‘Name’ column containing a single entry ‘domestic cats’, which will be the label for our unified buffers
- geometry: A list containing a single Shapely geometry object cat_zone, which represents the zone of influence of domestic cats.
- crs: The coordinate reference system (CRS) for the GeoDataFrame. It is set to cats_buffer.crs, i.e. it inherits the CRS from the original GeoDataFrame cats_buffer.
Let’s visualize the result to see if it is in the correct location:
Make this Notebook Trusted to load map: File -> Trust Notebook
Overlay
All in the correct place! Now we can intersect it with population data to see how many people are in the cat’s influence zone. First we need to read the geopackage with population data fromt he NZ Census.
In the Lab2_data folder, which you downloaded from Learn, you find a geopackage file named “censu_2018_pop.gpkg”. In your JupyterHub, in the folder named Lab2_data upload “censu_2018_pop.gpkg” before attempting the next steps . Go back to Lab 1 section 4.3 if you do not remember how to upload the data to Jupyter Hub .
Run the code below after uploading the data. It might take a minute to run.
# Census population 2018 data for NZ
population = gpd.read_file("Lab2_data/censu_2018_pop.gpkg")
# Inspect result
population.head(3)
| 0 |
7000006 |
105 |
MULTIPOLYGON (((1605432.509 6159874.213, 16054... |
| 1 |
7000010 |
0 |
MULTIPOLYGON (((1607386.316 6153776.189, 16073... |
| 2 |
7000033 |
138 |
MULTIPOLYGON (((1636338.253 6150839.030, 16363... |
We will use the function .overlay() to find which Census blocks are within the cat’s influence zone. Overlay refers to combining spatial datasets using operations like intersection, union, or difference to create new layers, revealing spatial relationships and shared areas between features.
# The keep_geom_type parameter is set to False to allow mixed-geometry types in the result
population= gpd.overlay(population, influence_cat, how= 'intersection', keep_geom_type=False)
# Inspect the result
population.head(3)
| 0 |
7021694 |
273 |
domestic cats |
MULTIPOLYGON (((1750542.367 5433780.542, 17505... |
| 1 |
7021999 |
216 |
domestic cats |
POLYGON ((1748699.206 5423083.308, 1748661.956... |
| 2 |
7022056 |
87 |
domestic cats |
MULTIPOLYGON (((1749871.362 5423602.663, 17498... |
Let’s visualize the overlay result:
Make this Notebook Trusted to load map: File -> Trust Notebook
Zoom in to feature on the map above you will see that the zones defined by our buffer are now subdivided into smaller polygons. These polygons are the SA1 (statistical areas 1) that were intersecting the zone defined by the unified buffer.
you can also see, by hovering the mouse over feature, that each feature has a SA1 code (SA12018_V1_00) and a population count (C18_CURPop) that came from the geopackage censu_2018_pop.gpkg. We also have the column Name that came from our influence_catbuffer.
We can caluclate the total population within the cat’s influence area by adding up the population count (C18_CURPop) for all SA1 polygons shown on the map above.*
population['C18_CURPop'].sum()
A total of 149,916 people reside within the cats’ influence zone as depicted in our datasets.